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points  may  be  decomposed  into  two  decoupled  tetsks:  first,  the  computation  of  the  locally 
smooth  models,  and  hence,  the  segmentation  of  the  data  into  classes  that  consist  on  the 
sets  of  points  best  approximated  by  each  model,  and  second,  the  computation  of  the 
normalized  discriminant  functions  for  each  induced  class  (which  may  be  interpreted  as 
relative  probabilities).  The  approximating  function  may  then  be  computed  as  the  optimal 
estimator  with  respect  to  this  measure  field. 

For  the  first  step,  we  propose  a  scheme  that  involves  both  robust  regression  and  spatial 
localization  using  Gaussian  windows.  The  discriminant  functions  are  obtained  by  fitting 
Gaussian  mixture  models  for  the  data  distribution  in  the  boundary  of  each  class.  We 
give  an  efficient  procedure  for  effecting  both  computations,  and  for  the  determination  of 
the  optimal  number  of  components.  Examples  of  the  application  of  this  scheme  to  image 
filtering,  surface  reconstruction  and  time  series  prediction  are  presented  as  well. 
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1  Introduction 

This  paper  addresses  the  problem  of  approximating  a 
function  r  :  R'^  ■-»  Q,  given  a  finite  set  of  data  points 
(examples)  {(i,,  z*).  €  R'^,  z,  €  Q,  i  =  1, .  ,  A'}  The 

set  Q  may  be  a  finite  set,  as  in  pattern  classification 
problems,  or  may  coincide  with  the  real  numbers  (as  in 
function  approximation,  regression  and  filtering) 

A  natural  way  of  approaching  this  problem  is  to  fit 
simple  bead  models  to  the  data,  which  are  then  put  to¬ 
gether  using  appropriate  weight  functions.  The  fitting 
process,  for  each  local  model,  involves  first,  the  assign¬ 
ment  of  a  relative  weight  to  each  data  point  that  indi¬ 
cates  its  importance  for  the  determination  of  the  model, 
tuid  second,  the  computation  of  the  parameters  that  min¬ 
imize  an  appropriate  weighted  error  measure 

These  two  steps  may  be  carried  out  in  a  decoupled 
fashion,  by  making  the  weights  depend  only  on  the  spa¬ 
tial  distribution  of  the  independent  variables  {x,}  Ex¬ 
treme  instances  of  this  strategy  are  the  local  learning 
algorithms  proposed  by  Vapnik  [1],  where  a  different  lo¬ 
cal  model  is  used  for  each  point  x  in  which  the  function 
IS  to  be  evaluated,  and  the  model  parameters  are  made 
to  depend  only  on  those  data  points  that  are  closest  to 
X.  Another  example  may  be  found  in  the  use  of  Radial 
Basis  Functions  with  fixed  knots  for  function  approxima¬ 
tion;  in  this  case,  the  location  of  the  knots  (and  hence, 
the  relative  data  weights)  are  determined  as  a  function  of 
the  distribution  of  {ii},  for  example,  using  the  K-Means 
algorithm  [15], 

This  decoupled  scheme,  although  computationally 
simple,  does  not  take  into  consideration  the  structure  of 
the  function  itself,  and  therefore,  will  lead  to  relatively 
complex  local  models,  so  that  the  approximation  will  be, 
in  genertd,  difficult  to  interpret;  if  the  local  models  are 
to  reflect  the  structure  of  the  function,  the  data  weights 
must  depend  also  on  the  goodness  of  fit 

A  simple  — although  computationally  expensive — 
way  of  doing  this  is  to  specify  parametric  models  for 
both  the  weights  and  the  local  regressi-ma,  aiid  then  ad¬ 
just  the  parameters  in  a  coupled  way  by  minimizing  an 
appropriate  cost  function  Thus,  for  example,  in  [16],  the 
cost  function  is  the  log-likelihood  of  a  statistical  mix¬ 
ture  model,  which  is  minimized  using  gradient  descent, 
the  local  models  ( “experts” )  are  multi-layer  perceptions, 
and  the  weights  (“gating  networks”)  are  linear  functions 
followed  by  an  exponential  transformation  and  normal¬ 
ization  (“softmax  units")  The  number  of  components 
is  found  by  setting  it  to  a  sufficiently  large  number,  and 
deleting,  upon  convergence,  those  components  with  in¬ 
significant  weights 

The  interpretability  of  the  approximation  increases  if 
the  local  models  are  given  a  simpler  form.  Also,  it  is 
computationally  more  efficient  to  start  with  a  single  com¬ 
ponent.  and  successively  add  new  ones  by  splitting  the 
domain  of  the  one  with  worst  local  fit  until  the  global 
fit  error  is  below  a  given  threshold.  This  is  done,  for 
example,  in  Recursive  Partition  Regression  (RPR)  [2], 
where  the  local  models  are  low-order  polynomials  and 
the  weights  are  step  functions  on  each  one  of  the  inde¬ 
pendent  variables. 

The  problems  with  this  approach  are  that  the  approx¬ 


imating  functions  are  always  discontinuous,  and  the  dis¬ 
continuities  are  always  piecewise  parallel  to  the  coor¬ 
dinate  axes,  so  that  continuous  functions,  or  piecewise 
smooth  functions  with  complex  discontinuity  shapes  are 
not  well  approximated  This  is  remedied  in  the  MARS 
approach  [5],  by  replacing  the  step  functions  by  trun¬ 
cated  power  spline  basis  functions,  and  by  not  remov¬ 
ing  the  parent  basis  functions  after  they  are  spht  The 
problem  here  is  that  now  the  resulting  approximation  is 
aways  globally  smooth  (so  that  discontinuous  functions 
are  not  well  approximated),  and  that  the  representation 
can  no  longer  be  expressed  as  a  sum  of  spatially  localized 
simple  models,  which  has  a  high  intuitive  appeal. 

The  objective  of  this  work  is  to  overcome  some  of 
these  limitations;  specifically,  our  goal  is  to  develop  a 
representation  that  is  :  easy  to  interpret;  efficient  to 
compute,  and  flexible,  in  the  sense  that  it  can  generate 
both  smooth  auid  piecewise  smooth  approximations  It 
is  based  in  the  following  ideas: 

The  most  flexible  representation  has  the  form  of  a 
measure  field,  which  is  defined  by  the  following  elements 

i)  A  relatively  small  set  of  simple  functions,  {i* 
R"*  —  Q.  t  =  I  ,m}  (the  local  models) 

ii)  A  probabilistic  description  of  the  regions  where 
each  local  model  is  a  good  approximation  of  the 
function,  that  is,  a  set  of  functions  {pt  :  R”* 

[0, 1],  it  =  1,  ,  m}  such  that  Pk{x)  =  1.0  for 

all  X,  where  pt(i)  represents  the  probability  that 
the  function  is  optimally  approximated  by  at 
location  x. 

Th»  pairs  {(pt,i*)}  define  a  discrete  measure  field  p 
that  represents  the  function; 

m 

t=i 

where  6{  )  is  the  usual  delta  function. 

One  may  specify  different  criteria  for  computing  the 
approximation  i  given  the  measure  field  (1).  For  exam¬ 
ple,  a  smooth  function  may  be  obtained  by  taking  the 
mean  value: 

/m 

zdpi(2)  =  ^Pi(i)zt(z)  (2) 

t=l 

and  a  piecewise  smooth  one  by  taking  the  mode: 

j(j-)  =  Zr(x)  ,  r  =  argmMpt(x)  (3) 

It  is  also  possible  to  include  additional  information,  such 
as  prior  conditions  on  the  shape  of  the  smooth  patches 
or  on  the  location  of  the  discontinuities,  and  compute  i 
as  the  optimal  Bayesian  estimator  [12] 

This  type  of  representation  has  been  proposed  in  the 
context  of  machine  learning  [16]  [8],  and  also  for  the  so¬ 
lution  of  a  class  of  computation ril  vision  problems  [12] 
It  will  be  adequate  if  the  models  for  the  functions  z  and 
p  are  intuitively  appealing,  and  allow  for  a  computation¬ 
ally  efficient  implementation  Here,  we  will  use  for  the 
local  models  polynomials  of  low  order  (0  or  1),  so  that 


piecewise  constant  and  piecewise  linear  functions  20-0  ex¬ 
actly  represented.  The  discriminant  functions  may  be 
modeled  as  normalized  sums  of  Gaussians: 


Pk(x) 


(4) 


where:  {Gij(  )}  are  Gaussians;  /?  is  a  positive  number 
that  controls  the  siiioothucss  uf  the  apprc.Xiir.a.tivjn,  and 
Uic  is  the  number  of  Gaussians  that  model  the  distribu¬ 
tion  of  class  k. 

For  the  computation  of  the  corresponding  parameters, 
we  propose  a  strategy  that  combines  the  computational 
efficiency  of  decoupling  the  weight  and  model  computa¬ 
tion  with  the  pc  wer  of  making  the  weights  dependent  on 
the  local  fit.  The  scheme  is  as  follows:  compute  the  local 
models  in  a  first  step  using  weights  that  depend  on  both 
X  and  z,  that  is,  using  local  robust  regression;  in  a  sec¬ 
ond  step,  classify  the  data  according  to  the  model  that 
best  approximates  the  function  at  each  point,  and  finally 
compute  the  p  functions  as  the  discriminants  for  this  in¬ 
duced  classification  task  (for  classification  problems,  of 
course,  the  first  2  steps  are  not  needed) 

The  number  of  components  is  determined  by  a  strat¬ 
egy  similar  to  that  of  RPR;  first,  components  are  added 
one  at  a  time,  until  the  error  variance  reaches  an  appro¬ 
priate  value.  In  a  subsequent  "backwards’’  step,  redun¬ 
dant  components  are  merged  (in  the  regression  phase) 
or  deleted  (in  the  classification  ph?«'‘) 

What  makes  this  methodology  attractive  is  the  ex¬ 
istence  of  efficient  algorithms  to  perform  the  compu¬ 
tations.  In  the  next  section  we  will  show  that  by  us¬ 
ing  Gaussian  windows,  local  robust  regression  becomes 
equivalent  to  the  estimation  of  the  parameters  of  a  Gaus¬ 
sian  mixture  model  (like  the  one  used  in  the  classifica¬ 
tion  stage),  and  that  this  problem  is  equivalent  to  vec¬ 
tor  quantization  with  a  quadratic  distortion  measure,  for 
which  efficient  algorithms  exist  Since  the  discriminant 
functions  in  the  classification  phase  are  also  modeled  as 
Gaussian  mixtures,  a  single  computational  framework 
may  in  fact  be  used  for  both  steps. 


2  Gaussian  Mixture  Models  and  the 
Generalized  K-Means  Algorithm 

Suppose  we  have  N  data  points  in  {i, ,  i  =  1, . . . ,  A'}, 
emd  we  want  to  approximate  their  distribution  with  n 
codewords  or  “centers”  {ruk  G  TV‘,k  =  1,  .  ,n}  that 

minimize  a  distortion  measure  of  the  form: 

A'  n 

f  ^  ^d(r.,T71t)s,l:  (.5) 

i=I  i-=I 

where  d{xi,Tnk)  is  a  local  distortion  measure,  such  as 

d(i,fc)  =  III,  -  mill^  (6) 

The  indicator  variables  s,)..  are  given  by: 

Sik  =  Is.(xi) 

with  I5  denoting  the  indicator  function  of  set  5.  and 
where  the  sets  St  consist  on  those  data  points  that  are 
mapped  into  codeword  it 

It  is  possible  to  show  [11]  that  the  following  (K- 
Means)  algorithm  will  never  increase  S. 


1:  for  t  =  1, .  .  ,  n,  assign  to  each  a  random  lo¬ 
cation  ; 

2  at  time  t,  set 

sj''  =  1  ,  if  d(x,,m^‘^)  <  d(i,mj'^)  ,  j  ^  k 

=  0  ,  otherwise 


(with  an  appropriate  rule  for  solving  the  ties)  and 

N  n 

m*^*'*'*^  =  argmm^^d(x,.ti)s^^i^  (7) 

i=i i=i 

for  example,  if  d(  ,  )  is  given  by  (6),  we  get 


EVl , 

„  I  ^»^it 


(I) 


(«) 


Alternatively,  one  may  consider  the  data  as  samples 
from  a  mixture  of  populations  {Sk,k  =  1,  —  n},  each 
one  normally  distributed  with  mean  m*  and  identical 
covariances  ^7 

Assuming  that  for  each  i  the  indicator  variables  for 
these  populations,  are  independently  and  identi¬ 

cally  distributed  according  with  a  multinomial  distribu¬ 
tion  consisting  on  one  equiprobable  draw  on  n  categories, 
the  conditional  log  density  for  data  point  x,  is  thus: 

n 

logp(i,  I  s)  =  -/7^s,tl|x,  -  mtiP  ~tr  K 

*=i 

where  K  is  a  constant. 

Assuming  that  the  data  x  =  {x;}  given  s  =  {sjt}  are 
conditionally  independent,  we  get  that  the  log  likelihood 
for  the  complete  data  x  and  s  is: 


n  N 

C{m)  =  -  mfcll^  +  NK  (8) 

ir=l  t  =  l 

We  may  now  find  the  maximum  likelihood  estimator 
for  m  =  {mt }  by  treating  s  as  missing  data  and  applying 
the  EM  algorithm  [4]  [7]:  in  the  expectation  (E)  step 
at  time  t,  the  expected  likelihood  <  E  >  is  found  by 
replacing  the  variables  {s,*}  in  (8)  with  their  conditional 
expected  value,  given  x  and  m*').  This  is  found  using 
Bayes  rule 

E[s,t  I  i.m''*]  =  =  Pr(i,  £  Sk  |  i, ,m*'^)  = 

exp[-g||x,  - 

Er=i  exp[-/7|li.  -  mj."||2] 

In  the  M  step,  we  find  the  va'ue  of  m  that  maximizes 
<  £  >  : 


=  arg  min  w\ 

mfc  •  ^ 


2 


In  the  limit  as  /J  — >  oo,  each  indicator  variable  is  replaced 
by  the  mode  of  the  conditional  distribution  (which  co¬ 
incides  with  the  mean)  in  the  E  step,  and  so,  the  EM 
algorithm  reduces  to  the  K-Means  scheme.  For  finite 


k 


* 


u 


I 


I 


values  of  0,  it  minimizes  a  distortion  measure  that  is 
obtained  &om  the  log  likeliiiouu  of  the  daia: 

N  /  n 

f  =  ^log  I  ^exp[-^||i,  -  mtip] 

i=l  \,t  =  i 

In  przuitice,  the  behavior  of  both  cdgorithms  is  very  sim¬ 
ilar,  although  it  has  been  reported  that  the  EM  scheme 
tends  to  produce  more  uniform  center  distributions  [16] 
If  one  uses  the  EM  algorithm,  one  can  estimate,  upon 
convergence,  the  indicator  variables  s,*  by 

Sik  =  1  ,  if  ui.i  >  Wij  for  all  j 

=  0  ,  otherwise  (9) 

and  thus  define  the  domaiin  for  each  center  k  as  the  set 
of  points  Xi  for  which  =  1. 

Suppose  now  that  in  addition  to  the  data  points  {x,} 
we  are  given  the  corresponding  values  {x,  }  of  a  function 
that  we  want  to  approximate.  The  problem  now  is  to 
find  not  only  the  set  of  centers  {m^}  that  represents 
the  data,  but  also  to  associate  with  each  one  of  them  a 
local  model  it(r; 5^)  that  represents  the  function  z  with 
minimum  distortion  in  the  domain  (Voronoi  polytope)  of 
the  corresponding  center.  This  means  that  we  now  have 
to  minimize  a  distortion  measure  of  the  form: 


N  n 

e  =  ]^log^exp-  [3  [Iji.  -  mtlp  +  A4>(x.,z.;ei.)|] 

i=l  i=l 

(10) 

where:  $  measures  the  fit  error  of  the  model  given  by 
the  parameter  vector  8k  (and  possibly  additional  prior 
contraints  on  this  vector),  and  A  is  a  parameter. 

To  derive  the  generalized  K-Means  (GKM)  aJgorithin 
in  this  case,  we  consider  a  mixture  model  for  the  joint 
distribution  of  x  and  z  and  obtain  the  GKM  scheme  as 
the  limit,  as  ^  •  oo  of  the  EM  algorithm  that  estimates 
the  vector  (m,  9)\  the  main  iteration  tak“s  now  the  form 
1.  E  Step:  compute 


Ul 


(‘)  _ 
it  — 


gt(i.) 

Er  Sr(x.) 


where 

Gk(xi)  =  exp  |-/9[|lx.  -  +  A$(i, ,  x,;  0^'')]j 

which  in  the  GKM  case  reduces  to: 

“’it*  =  1  •  if  ^t(A)  >  Sr(x,),r  ^  k 

=  0  .  otherwise 

2.  M  Step:  Set 

=  argrnin^tt't’  (l|i.  -  u||-  -I-  A<^(x, ,  x, ;  t»)) 

t 

The  choice  of  the  function  is  determined  by  the  de¬ 
sired  behavior  of  the  centers,  and  also  by  computational 
considerations:  if  it  is  cpiadratic  in  8,  the  maximization 
steps  in  the  above  algorithms  are  implemented  simply 
by  matrix  inversion.  We  now  discuss  the  specific  choices 
that  are  relevant  for  the  determination  of  the  local  mod¬ 
els  and  for  the  classification  stage. 


2.1  Local  Robust  Regression 

In  this  case,  a  natural  choice  for  4  is  the  quadratic  fit 
error;  if  the  local  models  sue  linear,  so  that  Zk  =  Ok  ■  z  + 
bk,  we  get: 

^(x:,z;8k)  =  {x-ak  z-bkf  (11) 

The  function  may  be  modified  to  enforce  some  prior 
constraint;  for  example,  to  give  higher  preference  to  hor¬ 
izontal  planes  (i.e.,  for  local  ridge  regression  [22])  we  may 
put 

^{i,z.9k)  -  {z  -  Ok  X  -  bkf  +  I'llntlP 
With  this  choice,  the  M  step  of  the  GKM  (or  EM) 
algorithms  is  equivalent  to  the  solution  of  a  decoupled 
set  of  linear  systems  (one  for  each  local  model)  whose  size 
grows  linearly  with  the  dimension  of  the  input  space 
The  solutions  found  by  these  algorithms  may  be  un¬ 
derstood,  in  terms  of  robust  regression,  as  the  best  n 
hyperplanes  that  pass  through  the  data,  with  the  con¬ 
straint  (enforced  by  the  ||x  —  terms)  that  the  sup¬ 

port  for  each  hyperplane  must  be  spatially  localized;  this 
constraint  is  crucial  to  get  good  results,  since  it  prevents 
the  system  from  finding  suboptimal  solutions  with  non¬ 
local  support  The  relative  weight  of  this  constraint  is 
controlled  by  the  parameter  A,  if  its  value  is  too  high, 
the  non-local  solutions  will  prevail,  while  if  it  is  too  low. 
the  location  of  the  centers  will  be  controlled  solely  by  the 
data  density.  The  solution,  however,  is  not  very  sensitive 
to  the  precise  value  of  this  parameter;  if  one  scales  the 
data  so  that  all  xi  are  inside  the  unit  hypercube,  and  all 
X,  are  in  the  interval  [0, 1],  a  value  of  A  between  1  and  5 
gives  good  results. 

The  nature  of  the  solution  also  depends  on  the  num¬ 
ber  of  local  models:  there  should  be  enough  of  them  for 
the  system  to  escape  locaJ  minima  that  correspond  to 
non-locally  supported  hyperplanes  On  the  other  hand, 
if  there  are  too  many,  one  would  be,  in  effect,  modeling 
the  noise  (i.e.,  overfitting  the  data)  with  the  correspond¬ 
ing  degradation  of  the  generalization  capabilities  of  the 
system. 

This  trade-off  between  model  capacity  and  generaliza¬ 
tion  error  has  been  widely  recognized,  and  criteria  dif¬ 
ferent  from  the  minimization  of  the  empirical  risk  have 
been  proposed  to  find  an  optimal  compromise  (e.g.,  min¬ 
imization  of  the  structural  risk  [1];  minimum  description 
length  criteria  [20]  or  cross  validation  techniques  [3]) 

In  this  case,  due  to  the  simplicity  of  the  local  models, 
one  may  use  directly  as  a  criterion  an  approximation  to 
the  expected  value  of  the  fit  error.  In  [6]  it  is  shown  that 
thix  expected  value  may  be  decomposed  into  a  bias  and 
a  variance  terms: 

Fp(x)-x(x))-]  =  (£i(x)-x(x))^  +  E[(x(x)-Ef(x))^] 

If  x(  )  IS  piecewise  linear,  and  there  are  enough  lo¬ 
cal  models,  the  corresponding  piecewise  linear  regression 
x(x)  IS  unbiased,  so  that  the  mean  fit  error  is  controled 
by  the  variance.  For  each  local  model,  the  variance  of 
the  predicted  mean  response  zt(p),  for  any  point  p  inside 
its  Voronoi  polytope  Sk,  may  be  estimated  as  [21]: 

^'k(p)  =  Var(it(p))  =  tr^(~+(p-Xkf(Xj'Xkr'(p-Xk)) 
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where  D  is  the  dimension  of  the  input  space  and  the 
(nt  X  D)  matrix  Xk  whose  rows  are  (i,  -  it)^  for  each 
I,  e  S*. 

Equation  (12)  allows  one  to  estimate  the  reliability  of 
the  approximation  at  any  point;  therefore,  one  may  use 
it  to  estimate,  in  addition  to  the  reconstructed  surface, 
a  corresponding  ‘‘error  map” . 

The  optimal  number  of  models  is  determined  by  mini¬ 
mizing  the  average  variance.  This  may  be  approximated 
by  the  mean  variance  at  the  centers  {r*}  of  the  Voronoi 
polytopes;  averaging  Vt(xt)  over  all  K  centers,  and  not¬ 
ing  that  nt  ss  N/K ,  w“  get; 


V\-  «  AVV-V 


where 

=  E 

k=l 

is  the  empirical  risk 

The  complete  strategy,  therefore  consists  in  adding 
one  model  at  a  time  (e  g  ,  by  splitting  the  center  with 
worst  local  fit  error)  until  a  maximum  number  Kmar  is 
reached,  and  then  pick  the  solution  that  corresponds  to 
the  number  of  centers  K  that  minimizes  V/< .  The  num¬ 
ber  Kmax  is  related  to  the  minimum  number  of  examples 
that  reliably  support  a  hyperplane,  and  may  depend  on; 
the  dimension  of  the  input  space;  the  total  number  of  ex¬ 
amples  amd  the  noise  power.  Although  it  is  convenient 
to  choose  it  in  a  reasonable  way,  its  precise  value  is  not 
critical  for  the  performance  of  the  system. 

Since  the  models  are  locally  supported,  it  is  possible 
that  several  of  them  correspond  in  fact  to  the  same  — or 
very  similar —  hyperplanes  For  this  reason,  it  is  nec¬ 
essary,  once  a  solution  is  found,  to  merge  together  the 
redundant  components  To  do  this,  one  should  find  an 
appropriate  way  of  measuring  how  different  two  hyper¬ 
planes  are.  It  may  be  shown  [19]  that  given  the  usual 
normality  assumptions,  the  relative  MSE  reduction  ob¬ 
tained  by  replacing  the  two  hyperplanes  by  a  single  one 
(with  the  appropriate  corrections  to  compensate  for  the 
different  number  of  parameters)  has  a  standard  F  dis¬ 
tribution  2uid  hence,  it  may  be  used  for  testing  the  cor¬ 
responding  hypothesis.  In  particular,  for  each  pair  of 
models  {j,k)  one  computes  the  statistic: 


-  (55;t-g5,-55t  )/(£>+!) 

(SS, +  SSt)/(ti, -fnt-2(D  +  l)) 
where  D  is  the  dimensionality  of  the  input  space; 
SSj  =  ^  (z.  -  z^(r,))- 

ss,k  =  Y, 

t:x ,  C-S'y  uSfc 


where  ijt  is  the  best  (LSE)  hyperplane  through  the 
points  in  Sj  U  Sk 

Each  merging  step  consists  in  finding  the  models  (j,  i) 
with  the  smallest  value  of  Fjk  -  this  statistic  is  then  com¬ 
pared  with  the  tabulated  value  of  F  with  (D-f  1)  and 
(rij  -Pn*  —  2(Z?-)- 1))  degrees  of  freedom  at  the  appropriate 
confidence  level;  if  it  is  smaller,  the  2  models  are  replaced 
by  Zjic  and  the  whole  process  is  repeated  until  the  null 
hypothesis  (equality  of  the  regression  hyperplanes)  is  re¬ 
jected  for  the  smallest  F. 

This  test  may  fail  if  the  normality  assumption  is 
grossly  violated  (although  we  have  found  that  it  is  rel¬ 
atively  robust)  In  these  cases  it  may  be  convenient  to 
also  put  an  upper  bound  on  the  number  of  linear  compo¬ 
nents  that  one  wants  to  include  in  the  complete  model, 
so  that  the  merging  process  stops  only  if  the  smallest  F 
is  too  large  and  the  total  number  of  components  is  below 
this  bound. 

After  the  merging  process  is  completed,  one  associates 
with  each  data  point  r,  a  class  c,  that  corresponds  to  the 
model  that  contains  the  center  k  for  which  the  indica¬ 
tor  variable  s,i  =  1  To  compute  the  desired  measure 
field  representation,  one  now  has  to  find  the  discriminant 
functions  for  the  data  {(r,,c, ),i  =1  N) 


2.2  Local  Gaussian  Classifiers 

Consider  the  problem  of  finding  discriminant  functions 
that  optimally  classify  the  data:  {(i,,c,),i  =  l  A^}. 
c,  €  {1,  -  ,  Q)  for  all  t  A  simple  way  of  doing  •<= 
to  approximate  the  data  distribution  within  each  class 
c  with  a  sum  of  Gaussians  whose  centers  are  found 
again  with  the  GKM  algorithm  [13]  This  may  be  ob¬ 
tained  from  the  above  model  by  setting  A  to  a  very  large 
value,  and  to: 


4>(x,  z;mtc)  =  1  -  ^(^(*)  -  c) 


where  is  the  center  of  the  k‘^  Gaussieui  that  approx¬ 
imates  the  data  distribution  of  class  c 

One  may  now  define  a  1-parameter  family  of  discrim¬ 
inant  functions  (indexed  by  the  positive  parameter  0) 
as 


Pcix) 


Etli 

Ef=iE:;iGM(x)^ 


(13) 


where 


Gkc(r)  =1  Akc  I  '''•  exp[-i(r  -  rrikcf  A^^{x  -  rrikc 


The  matrices  Akc  (|  Akc  (  denotes  their  determinants) 
may  all  be  taken  equal  to  the  identity  (in  which  case  this 
procedure  becomes  similar  to  the  LVQl  method  [9]),  or 
they  may  be  computed  as  the  covariance  matrices  inside 
the  corresponding  Voronoi  polytope  (see  [13]): 


Ak,  =  —  Y  “  '^tc)(xi 

”tc  ^ 


rrikc) 


(this  is  usually  important  in  high  dimensions  with 
anisotropic  data). 

For  high  values  of  0,  the  classifier  simply  assigns  to 
each  point  the  class  of  the  closest  center  (with  a  distance 
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weighted  by  A  matrices)  As  3  decreases,  the  discrimi¬ 
nant  functions  — and  hence  the  inter-class  boundaries — 
become  smoother,  so  that  3  controls,  in  some  sense,  the 
trade-off  between  capacity  and  generalization  capabili¬ 
ties. 

If  the  classifier  is  trained  in  such  a  way  that  it  achieves 
(near)  perfect  classification  of  the  data  with  high  3.  this 
parameter  may  then  be  used  to  fine-tune  the  classifier 
(using,  for  example  cross-validation)  to  attain  low  gen¬ 
eralization  errors. 

Training  the  classifier  means  finding  the  minimal 
number  of  centers  that  achieve  a  given  classification  er¬ 
ror.  For  this,  one  may  use  the  following  procedure:  start 
with  one  center  per  class;  upon  convergence  of  the  K- 
Means  procedure,  compute  the  approximate  classifica¬ 
tion;  add  a  new  center  at  a  location  that  corresponds 
to  a  miscltissified  data  point  in  the  class  with  highest 
classification  error,  and  repeat  the  K-Means  process 

Once  the  target  approximation  accuracy  (or  the  maxi¬ 
mum  number  of  allowable  centers)  is  reached,  one  should 
“prune”  the  irrelevant  centers  (those  sample  the  interior 
of  a  class,  and  therefore  do  not  contribute  to  the  cor¬ 
rect  classification  of  any  point).  More  precisely,  a  center 
mtc  IS  pruned  if  its  irrelevance  function  I{k.c)  equals  1 
/( iS-  c)  IS  defined  as : 


/(I:,c)  =  <5(  ^  (1  -  6(c  -  itc(x,)) 

\i.i(x.)=t 

where  i(xi)  is  the  current  estimate  for  the  class  of  r,, 
and  i*«(Xi)  is  the  class  that  would  be  estimated  if  mi-^ 
was  deleted.  Note  that  for  high  3  one  needs  to  consider 
in  the  sum  only  those  points  in  the  Voronoi  polytope  of 
mtj,  and  Zie(x.)  is  simply  the  class  associated  with  the 
second  nearest  center  to  x, . 

In  practice,  it  is  possible  to  prune  the  irrelevant  cen¬ 
ters  dynamically  by  computing  I{k,c)  for  all  centers  at 
each  iteration;  if  a  center  is  found  to  be  irrelevant,  it  is 
pruned  and  all  the  data  points  inside  its  Voronoi  poly¬ 
tope  are  marked  as  inactive  In  this  way,  one  works  with 
less  auid  less  data  points,  accelerating  the  whole  process 
(provided  that  at  each  iteration  one  checks  the  inactive 
points  and  includes  again  those  that  are  misclassified) 
We  have  found  that  in  general  this  procedure  gives  the 
same  results  as  the  one  described  above,  except  when 
there  is  a  very  large  inter-class  overlap  (claissification 
noise),  when  the  dynamic  pruning  may  fall  into  a  limit 
cycle. 

An  interesting  by-product  of  this  pruning  procedure 
IS  a  secondary  class  that  may  be  associated  with  each 
data  point,  namely,  an  indicator  of  its  relevance  or  ir¬ 
relevance  for  the  primary  classification  task  One  may- 
then  construct  a  binary  classifier  for  this  secondary  class 
which  may  be  useful  for  guiding  the  recollection  of  new 
data,  particularly  when  this  involves  complex  and  ex¬ 
pensive  procedures. 

3  Examples 

In  this  section  we  present  some  examples  that  illustrate 
the  performance  of  these  techniques. 


The  first  set  of  examples  deal  with  2-dimensional 
problems  The  first  one  shows  that  the  system  cor¬ 
rectly  identifies  the  piecewise  linear  structure  when  it 
is  present:  panel  (a)  of  figure  1  shows  a  function  x(x,y) 
that  is  equal  to  a  tilted  plane  inside  an  L-shaped  re¬ 
gion  and  IS  constant  elsewhere  The  input  to  the  sys¬ 
tem  consisted  on  200  triplets  {(xi,  y.,  Xi)},  where  each 
(x, .y,)  IS  a  randomly  selected  point  on  the  square  and 
X,  =  z(z,,y,)  +  n,,  where  {n;}  is  a  set  of  independent, 
identically  distributed  Gaussian  random  variables  with 
zero  mean  and  variance  equal  to  0.05  In  step  1 ,  the  algo¬ 
rithm  found  5  local  models,  which  after  the  merging  step 
were  reduced  to  2  The  second  step  was  thus  a  binary- 
classification  problem  for  which  perfect  classification  was 
achieved  with  26  Gaussians  (with  covariance  matrices 
equal  to  the  identity),  which  after  pruning  were  reduced 
to  17  The  reconstructed  surface,  obtained  as  the  mode 
of  the  measure  field  with  /J  =  8,  is  shown  in  panel  (b) 
■Note  that  additional  information  about  the  inter-class 
boundary  may  be  easily  incorporated,  improving  the  so¬ 
lution  without  having  to  recompute  the  measure  field,  for 
example,  if  one  wishes  to  reconstruct  the  surface  m  the 
nodes  of  a  square  lattice  L,  one  may  include  a  smooth¬ 
ness  constraint  on  these  boundaries  in  the  form  of  a  prior 
.Markov  Random  Field  (MRF)  model  for  the  shape  of  the 
smooth  patches  (see  [12])  Using,  for  instance,  a  first 
order  MRF  model  with  Ising  potentials,  one  may  obtain 
the  optimal  classification  {c, .  i  Q  L}  as  the  MPM  estima¬ 
tor  [14]  with  respect  to  the  posterior  Gibbs  distribution 
with  energy: 

=  Y^V{c,.Cj)  - 

with 


V(a,b)  =  -7,ifa  =  6 
=  7  ,  otherwise 

where  [i,^]  denotes  summation  over  all  nearest  neighbor 
pairs  of  sites  in  I.  7  is  a  parameter  and  xl{i)  is  the 
coordinate  vector  of  node  i 

Figure  1-c  shows  the  top  view  of  the  “true”  classi¬ 
fication:  panels  (d)  and  (e)  show  the  maximum  likeli¬ 
hood  classifications  (obtained  by  selecting  the  class  with 
largest  at  each  node)  for  /?  =  8  and  3  =  2,  respec¬ 
tively.  The  optimal  Bayesian  (MPM)  classification  (with 
7  =  10)  is  shown  in  panel  (f)  Note  that  in  all  cases  all 
the  examples  are  correctly  classified;  the  generalization 
performance  however,  will  clearly  be  superior  in  the  last 
case. 

Figure  1  around  here 

The  second  example  illustrates  the  system  perfor¬ 
mance  when  one  of  the  surface  patches  is  non-linear 
For  this,  we  use  the  function  studied  in  [10],  which  is 
illustratred  in  panel  (a)  of  figure  2  on  the  unit  square 
Again,  the  input  to  the  system  consisted  in  200  triplets 
{(x.,y,,x,))  with  (xi,y,)  selected  randomly.  The  system 
found  6  components  in  step  1  (after  merging),  and  21 
Gaussians  (with  covariance  matrices  equal  to  the  iden¬ 
tity)  in  the  6-way  classifier  of  step  2.  The  reconstructed 
surface  is  now  computed  as  the  mean  of  the  measure  field 
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(to  get  a  smooth  reconstruction)  It  is  shown  in  panels 
(b)  and  (c)  of  figure  2  for  J  =  10  and  3  =  2,  respectively 

Figure  2  around  here 

3.1  Image  Filtering  and  Segmentation 

An  important  class  of  problems  arises  when  one  has 
dense,  noisy  data  on  a  regular  lattice  (i.e.,  an  image),  so 
that  no  interpolation  is  needed.  In  this  case,  the  desired 
result  IS  just  the  output  of  step  1.  This  output  may  be 
interpreted  both  as  a  filtered  image  (since  the  fitted  lo¬ 
cal  models  smooth  out  the  noise)  and  as  a  segmentation, 
where  each  segment  corresponds  to  the  domaun  of  each 
local  model.  Since  there  is  an  observation  at  every 
point  where  one  wants  to  compute  the  approximation, 
one  may  also  construct  a  measure  field  representation 
by  putting; 

exp  [-d[llx.  -  mt||-  -b  A(j.  - 
pt  X,  -  exp[-A3[||i.  -  -p  A(c,  -  ijlr,))-]] 

for  each  center  k  (note  that  after  merging  one  may  have 
ik{x)  =  i](x)  if  centers  j  and  k  belong  to  the  same 
component).  By  setting  the  filtered  image  to  the  mean 
value  with  respect  to  this  measure  field,  one  may  use 
3  to  control  the  amount  of  inter-component  smoothing 
for  large  3  we  get  cin  edge-preserving  filter  and  for  small 
3  a  global  smoother.  The  performance  of  this  scheme 
for  a  piecewise  linear,  noisy  synthetic  image  is  shown  in 
figure  3  Note  the  correct  identification  of  the  3  linear 
components  (after  merging),  and  hence,  the  correct  seg¬ 
mentation,  in  spite  of  the  fact  that  the  intensity  gradient 
between  linear  regions  disappears  in  some  parts  of  the 
image.  The  shape  of  the  inter-class  boundaries  may  be 
made  smoother  (and  the  isolated  pixels  eliminated)  by 
using  a  Gibbsian  prior  and  computing  the  MPM  estima¬ 
tor  as  above. 

Figure  3  around  here 

A  similar  idea  may  be  used  for  image  quantization; 
suppose  one  wishes  to  represent  an  image  with  a  fixed, 
small  number  of  grey  levels.  In  this  case,  the  local  mod¬ 
els  axe  O-order  polynomials  (constants)  and  the  merging 
phase  IS  stopped  when  the  desired  number  of  components 
IS  reached.  The  result  is  illustrated  in  figure  4,  where  a 
detail  of  a  real,  noisy  image  is  iiuantized  to  5  levels  (a 
uniform  quzintization  is  also  included  for  comparison) 

Figure  4  around  here 

3.2  Time  Series  Prediction 

This  example  illustrates  the  system  performance  in  more 
than  2  dimensions  and  in  a  situation  where  the  basic 
model  assumptions  (piecewise  linear  structure  of  the  true 
function  and  additive  Gaussian  noise)  are  grossly  vio¬ 
lated.  as  we  show,  the  performance  is  still  reasonable  in 
this  case. 

The  task  is  to  predict  the  value  of  a  time  series  at 
some  future  time,  beised  in  4  previous  values.  To  be  able 
to  make  comparisons  with  other  approximation  meth¬ 
ods,  we  use  the  same  example  zis  in  [17];  the  chaotic 


•  •  •  • 


time  series  created  by  the  Macxey -Glass  delay -difference 
equation 

with  a  =  0  2,  6  =  01  and  r  =  17  • 

Each  4-dimensional  example  point  x,  is  fortr^d  by  the 
values  of  the  series  at  times;  Ti  —  6,  T,  —  12,  T,  —  12  and 
T,  —  18  The  corresponding  Zi  is  the  value  at  T,  -I-  85 
Using  a  training  set  of  1000  examples  the  local  robust 
regression  finds  26  components,  which  after  merging  are 
reduced  to  21  For  the  classification  phase,  a  simple 
Gaussian  cl^lssifier  (i.e.,  with  one  center  per  class  and  ® 

J  =  1)  was  used  (giving  3  8  %  classification  error)  The 
final  normalized  rms  error  is  0.12,  which  is  comparable 
to  the  accuracy  of  a  standard  Radial  Basis  Function  net¬ 
work  [18]  performing  the  same  task  (with  an  rms  error 
of  0  1),  as  reported  in  [17] 

The  most  accurate  algorithms  reported  there  a  I 

multi-layer  perceptron  trained  with  back  propagation 
and  a  particular  Resource-allocating  network,  achieved 
a  normalized  rms  of  0  03)  It  is  worth  noting,  how¬ 
ever,  that  the  computational  load  of  our  method  is  very 
light,  and  the  total  number  of  parameters,  which  is 
21  X  (4  2  -t-  10)  =  336,  makes  this  representation  more 

compact  than  all  others  reported  in  [17]  (all  use  more  ► 

than  500  parameters) 

4  Summary  and  Discussion 

We  have  presented  a  strategy  for  approximating  a  func¬ 
tion,  given  a  set  of  examples,  in  which  an  intermedi¬ 
ate  representation  — a  measure  field —  is  computed  in  • 

a  first  stage.  This  representation,  which  consists  in  a 
set  of  simple  local  models  with  associated  probabilities, 
has  the  advantage  of  allowing  for  globally  or  piecewise 
smooth  reconstructions  without  any  additional  compu¬ 
tational  effort;  also,  it  permits  the  incorporation  of  ad¬ 
ditional  information  about  the  reconstructed  function  ^ 

(e  g  ,  about  the  location  or  shape  of  the  discontinuities, 
etc.)  in  an  easy  way 

For  the  computation  of  this  representation,  we  pro¬ 
posed  a  procedure  in  which  the  local  models  are  found 
in  a  first  step,  and  induce  a  segmentation  of  the  data  (in 
terms  of  which  local  model  is  supported  by  each  data 
point)  The  associated  probabilities  are  then  computed  I 

in  a  second  step  as  the  discrirrunant  functions  of  this 
induced  classification  task 

This  procedure  has  the  advantage  of  combining  the 
computational  simplicity  of  decoupled  methods  for  find¬ 
ing  the  domain  of  each  local  model,  with  the  power  of 
making  the  local  model  selection  dependent  on  the  lo¬ 
cal  fit,  which  permits  the  use  of  simpler  models  that  aue 
easy  to  interpret  What  makes  it  specially  attractive  is 
the  existence  of  efficient  algorithms  for  performing  these 
computations;  these  algorithms  are  generalizations  of  the 
K-Means  procedure  for  vector  quantization,  with  the  ad¬ 
dition  of  terms  that  measure  the  goodness  of  fit  Here, 
they  were  derived  from  the  EM  algorithm  for  computing  i 

the  parameters  of  a  Gaussian  mixture  model 

An  important  problem  that  this  procedure  has  in  com¬ 
mon  with  all  methods  that  approximate  a  function  as 
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a  .  r'  "  H!(  .  oi  local  models  is  the  trade-off  between 
overall  capacity  and  generalization  capabilities  Here 
we  proposed  a  simple  statistical  criterion,  namely  the 
average  variance  of  the  mean  predicted  response  of  each 
model,  to  find  the  optimal  number  of  modeb 

It  b  abo  important  to  eliminate  redundtincies  after 
each  step  has  converged;  these  redundancies  are  caused, 
in  the  first  step,  by  the  fact  that  the  local  modeb  are 
spatially  localized  (so  that  a  hyperplane  that  extends 
over  a  large  region  of  the  input  space  will  be  broken  into 
several  components  to  find  a  good  approximation  to  its 
boundary).  In  the  second  step  there  are  Gaussians  that 
approximate  the  data  distribution  away  from  the  inter- 
cleiss  boundaries  and  are  thus  irrelevant  for  the  classifi¬ 
cation  task  In  the  proposed  scheme,  these  redundancies 
Me  eliminated  by  merging  similar  hyperplanes  (using  a 
criterion  based  on  the  F  statistic)  after  the  first  step 
and  by  pruning  irrelevant  components  after  the  second, 
this  last  method  has  the  advantage  of  producing  a  model 
for  the  relevance  of  the  examples,  which  may  be  used  to 
guide  the  gathering  of  additional  data 

The  results  of  the  experiments  we  have  performed  in¬ 
dicate  the  plausibility  of  this  approach  for  general  func¬ 
tion  approximation  tasks,  and  specially  for  surface  re¬ 
construction  problems  like  those  that  occur  in  image  pro¬ 
cessing  and  computational  vision 
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FIGURE  CAPTIONS 


Figure  1:  (a)  Original  surface,  (b)  Reconstructed  surface  from  200  exam¬ 
ples.  (c)  True  classification  (d)  Maximum  likelihood  classification  (3  =  8). 
(e)  Maximum  likelihood  classification  (J  =  2).  (f)  Bayesian  (MPM)  classifi¬ 
cation. 


Figure  2:  (a)  Original  surface,  (b)  Reconstructed  surface  from  200  examples 
(3  =  10).  (c)  Same  as  (b)  with  J  =  2. 


Figure  3:  (a)  Original  noisy  image,  (b)  Filtered  image,  (c)  Segmentation 
(each  grey  level  corresponds  to  a  linear  component). 


Figure  4:  (a)  Original  image,  (b)  Optimal  quantization,  (c)  Uniform  quan¬ 
tization. 
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